Term Frequency Models

This notebook demonstrates how to use the Corpus objects to train and view a term frequency (TF) or "word count" model, a term frequency-inverse document frequency (tf-idf) model, the Latent Semantic Analysis (LSA) model, and view topics in the LDA models trained throught the topic explorer.

To run the notebook, use the menu Cell -> Run All or use the play button for one cell at a time.


In [ ]:
# First we load the vsm module and import your corpus
from vsm import *
from corpus import *

Term Frequency Models

The term frequency model is the most primitive of all vector space models. For each document, we simply count the number of occurrences of a word.


In [ ]:
# train and initialize the Term Frequency Model
tf = TF(c, context_type)
tf.train()
tf_v = TfViewer(c, tf)

In [ ]:
# print the most frequent terms in the document
tf_v.coll_freqs()

tf-idf Models


In [ ]:
#train and intialize the tf-idf model
tfidf = TfIdf.from_tf(tf)
tfidf.train()
tfidf_v = TfIdfViewer(c, tfidf)

LSA Models


In [ ]:
lsa = Lsa.from_tf(tf)
lsa.train(k_factors=50)
lsa_v = LsaViewer(c, lsa)

LDA Models

The LDA models are automatically imported in the dictionary lda_v[k] where k is the number of topics.


In [ ]:
lda_v[20].topics()

In [ ]:
v=lda_v[20]

In [ ]:
# show the difference in document relations across documents
import numpy as np
words = ['moral']
b=np.array(v.dist_word_top(words, show_topics=False))
sim_docs = [tf_v.dist_word_doc(words),
tfidf_v.dist_word_doc(words),
lsa_v.dist_word_doc(words),
v.dist_top_doc(b['i'], weights=np.ones_like(b['value']) - b['value'])]
[docs[:10]['doc'] for docs in sim_docs]